Python Data Mining Quick Start Guide by Nathan Greeneltch

Python Data Mining Quick Start Guide by Nathan Greeneltch

Author:Nathan Greeneltch [Nathan Greeneltch]
Language: eng
Format: epub
Tags: COM018000 - COMPUTERS / Data Processing, COM062000 - COMPUTERS / Data Modeling and Design, COM089000 - COMPUTERS / Data Visualization
Publisher: Packt
Published: 2019-04-24T11:20:04+00:00


PCA

PCA is used to reduce the dimensions of data in an unsupervised manner. The method's goal is to identify new feature vectors, maximize the variance in the data, and then project the original data into this new space. Please revisit the short example in the previous section for an intuitive description.

The new feature vectors that maximize variance are called eigenvectors, and are the principal components. There is one component for each original feature. The power of this method comes when you drop the less important ones and keep only those with the most informative content, thus lowering the dimensions. Scikit-learn has an explained_variance_ attribute that can be used to rank the importance of each principal component. More commonly in data mining, you will use the n_components arg to specify a new, lowered number of dimensions and allow scikit-learn to sort by variance and drop the features automatically.

In the following PCA example, the raw scatter plot of the iris dataset is on the left. The most variation is captured in the direction of the red arrow ("PCA1"), and the runner-up is the orthogonal direction that is captured by the black arrow ("PCA2"). Now imagine rotating the dataset so that the two axes are the first two principal components. Finally, study the PCA scatter plot on the right where the axes are the directions, "PCA1" and "PCA2":

The connection between the right and left scatters should be clear in your mind before you move on from this section. It's this kind of intuition that will allow you to do powerful analysis while also knowing what the underlying mathematics is doing. The methods in this book are not black boxes, and you should force yourself to learn and understand them. You almost certainly do yourself a disservice as a data mining practitioner otherwise.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.